DOMAIN :- Healthcare

Import and warehouse data

Task: Import all the given datasets and explore shape and size of each.

Task: Merge all datasets onto one and explore final shape and size

Data cleansing

Task: Explore and if required correct the datatypes of each attribute

Explore for null values in the attributes and if required drop or impute values

There are same classes but different format. So converting all same classes to one format

There is no null values in the datasets. And we altered the different classes into common classes name as Normal,H_type,S_type

One-Hot-Encoding on categorical text

Data analysis & visualisation

Along the diagonal we can see the distribution of individual variable

P_incidence has postive realtionship with all variables except P_radius. Relationship is higher for S_slope and L_angle

P_tilt has Higher Relationship with P_incidence and L_angle.There is no Relationship with s_slope and p_radius

L_angle has postive Relationship with p_tilt,s_slope and s_degree. It has no Relationship with P_radius

s_slope has positive Relationship with L_angle and s_degree

p_radius has no Relationship with s_degree,p_tilt,l_angle.

S_degree has no strong positive Relationship with any of the variables.

It is clear that S_Degree of Type_S contains larger values.

It is seen that P_incidence has positively correlated with S_Slope and L_angle

P_radius has negatively correlated with P_incidence,L_angle and S_slope

Box Plot Univarant

It is observed that there are many outliers present in P_tilt, P_radius, S_Degree

Bi-variant Analysis

Ranges of the variable

Let group all continous value to categorical format to analysis distribution of features

Binning

All cases falls under 4 groups but as P_incidence value increases majority of classes can be seen as Type S.

Higher P_incidence --> Mostly likely cases will be Type-S

From above Chart --> Higher values of S_Slope dont have Type_H cases

Lower S_Slope value higher Type-S cases than Normal Cases

Lower P_radius value --> Higher Type S cases

Higher S_Degree value --> more number of Type S cases

Lower S_Degree value --> more number of Normal cases

Summary

Data Pre-processing

Segregate predictors vs target attributes

Performing normalisation or scaling

Percentage of the data

Observation :Class distribution among Normal, Type H & Type S in dataset is almost 2:1:3 respectively

Train and Test Split

Model Building

KNN Classifier

Choosing Optimal K value

Performance of our model with training data

Performance of our model with testing data

Testing data has sightly higher accuray than training data

Observation:

As we have 3 type of cases, our model F1- Score lies around 86%

Error Rate calculation

Accuracy for Minimum error rate

Posible tunning with hyper parameter tuning

Observation

All the variables has significant effect on target class

class belongs to type_s has higher mean value for alomst all variables

Class belongs to normal has lower values for all variables

For almost all variables the distribution is normal

For Knn, k=13 we are getting balanced train and test error

Suggestion

Clear description on each variables may help to understand problem statement better because of medical domain

PART 2 -- DOMAIN :- Banking and Finance

Import and warehouse data

Task: Import all the given datasets and explore shape and size of each

Task: Merge all datasets onto one and explore final shape and size

Data Cleaning

Data analysis & visualisation:

Dataset has more age distribution between 40- 60

Max retained customer to this bank about 20-30 years

Observation

1.Age and Experience are highly correlated which is quite intutive, will be dropping Experience from further analysis

2.There are some varible which has no correlation.

3. Loan on Card is correlated with Highest spend by customer.

4. Monthly avg spend is correlated with Highest spend

Box Plot Univariant analysis

Further Analysis on Dataset

Customer not having a securities account are more likely to respond positively

Customer not having a FDA account are more likely to respond positively

Customer having a InternetBanking are more likely to respond positively

Customer who do not hold a credit card provided by the bank are more likely to reponnd postively to the survey

Summary

Customer without a mortgage are likely to respond positively

Data pre-processing:

Segregate predictors vs target attributes

There is imbalance in the target field. only 9.5% customer has borrowed loan from the bank

Perform train-test split.

Model training, testing and tuning:

Logistic Regression:

Age,Security,InternetBanking,Creditcard has negative impact towards logodds ratio of response

Performance of our model with training data

Performance of our model with testing data

Overall accuray of the model comes 94%

Navie Bayes:

Performance of our model with training data

Performance of our model with testing data

NEW Tunning method--> Here We are Converting all Numerical or Contionous to Categorical

Comparision Result

Note:

Gaussian Naive Bayes 1 is found without using discretization

Gaussian Naive Bayes 2 is found with using discretization (Converting Numerical to categorical)

Insights

1.Customers without a Credit Card provided by the bank, without a FDA, Securities Account and Mortgage are potential targets

2.Customers having access to online banking are potential targets.

3.Age,Security,InternetBanking,Creditcard has negative impact towards logodds ratio of response

Suggestion:

1. Data should include Family members,Income,Education to do better analysis

2. Data should include vehicle owned by customer